Reload new config without restarting process #2716

ganmacs · 2019-12-09T08:54:29Z

Which issue(s) this PR fixes:
Fixes #2624

What this PR does / why we need it:

This change make fluentd be able to reload new config without restarting process with SIGUSR2 signal.

it's very lighter and safer than existing reloading feature. but it has 2 limitations.

A change to system_config is ignored because it needs to restart(kill/spawn) process.
All plugins must not use class variable when restarting.

Docs Changes:

Add USR2 description.

Release Note:

same as title

Signed-off-by: Yuta Iwama <[email protected]>

to check if config is valid before invoking configure Signed-off-by: Yuta Iwama <[email protected]>

Signed-off-by: Yuta Iwama <[email protected]>

cosmo0920

I feel StaticConfigAnalysis gives us special powers to report misconfiguration checking phase. 💪
I've added two comments for small concerns.

lib/fluent/engine.rb

because replacing Supervisor#read_config with Fluent::Config.build Signed-off-by: Yuta Iwama <[email protected]>

Signed-off-by: Yuta Iwama <[email protected]>

ganmacs · 2019-12-23T02:12:02Z

@repeatedly ?

lib/fluent/plugin/buf_file.rb

lib/fluent/static_config_analysis.rb

test/test_config.rb

Signed-off-by: Yuta Iwama <[email protected]>

repeatedly · 2020-01-06T01:41:04Z

I noticed we need to update HTTP RPC for this feature.
Should we change this to use USR2 or add new path?

fluentd/lib/fluent/supervisor.rb

Line 106 in c809788

@rpc_server.mount_proc('/api/config.reload') { |req, res|

cosmo0920 · 2020-01-06T02:03:17Z

Should we change this to use USR2 or add new path?

I believe that we should add a new path for reloading config.
Windows does not have USR2 signal:

https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/signal?view=vs-2019

Signed-off-by: Yuta Iwama <[email protected]>

ganmacs · 2020-01-06T05:01:40Z

I noticed we need to update HTTP RPC for this feature.
Should we change this to use USR2 or add new path?

Good catch! I forgot about it. I added new endpoint /api/config.gracefulReload(d8b8bc6) for compatibility.

Windows does not have USR2 signal:

I think this feature can't support windows like USR1 because a supervisor needs to send USR2 signal to workers to do this. https://github.com/fluent/fluentd/pull/2716/files#diff-dbe0e1ec4079138e48ca6a4d7c7248f9R215
We need to introduce another way to communicate between supervisor and workers without using a signal to be able to use this feature in windows.

repeatedly · 2020-01-07T05:22:01Z

I will release v1.9.0.rc1 soon.

This replaces the current `GracefulReload` (`SIGUSR2`) (fluent#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (fluent#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see fluent#4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * fluent#4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * fluent#2259 * fluent#3469 * fluent#3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `GracefulReload` (`SIGUSR2`) (#2716) with the new feature on non-Windows: * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/config.gracefulReload` * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional GracefulReload: * The traditional GracefulReload feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Signed-off-by: Daijiro Fukuda <[email protected]> Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]>

ganmacs force-pushed the light-reload branch 4 times, most recently from 620e7b0 to 7b64af2 Compare December 11, 2019 09:21

ganmacs added the enhancement Feature request or improve operations label Dec 16, 2019

ganmacs force-pushed the light-reload branch from 7b64af2 to e3dee30 Compare December 18, 2019 06:59

ganmacs added 8 commits December 18, 2019 16:29

Move method building config to Fluent::Config's class method

7e1a811

Signed-off-by: Yuta Iwama <[email protected]>

Assign ivar to make fluentd_conf changable

b70bd2b

Signed-off-by: Yuta Iwama <[email protected]>

Add new class variable store to control all state

b2d49a8

Signed-off-by: Yuta Iwama <[email protected]>

light reload

c0bbf66

Signed-off-by: Yuta Iwama <[email protected]>

remove needless require

2fe2e9d

Signed-off-by: Yuta Iwama <[email protected]>

block nil

e6a38ed

Signed-off-by: Yuta Iwama <[email protected]>

Move reading config test

ffbf1c5

Signed-off-by: Yuta Iwama <[email protected]>

Merge duplicated logic

f1c23e0

Signed-off-by: Yuta Iwama <[email protected]>

ganmacs force-pushed the light-reload branch 2 times, most recently from 434d98d to ce616a0 Compare December 18, 2019 09:00

ganmacs changed the title ~~Be able to reload new config without restarting process~~ Reload new config without restarting process Dec 18, 2019

ganmacs requested review from repeatedly and cosmo0920 December 18, 2019 09:06

ganmacs force-pushed the light-reload branch from 4ea8f1c to 752c83f Compare December 18, 2019 09:12

ganmacs added 3 commits December 18, 2019 18:35

Add StaticConfigAnalysis

137823f

to check if config is valid before invoking configure Signed-off-by: Yuta Iwama <[email protected]>

Add test for reload config

1bd4889

Signed-off-by: Yuta Iwama <[email protected]>

Error raised in Thread is handled by myself

ae605ea

Signed-off-by: Yuta Iwama <[email protected]>

ganmacs force-pushed the light-reload branch from 752c83f to ae605ea Compare December 18, 2019 09:35

cosmo0920 reviewed Dec 19, 2019

View reviewed changes

lib/fluent/engine.rb Outdated Show resolved Hide resolved

lib/fluent/engine.rb Outdated Show resolved Hide resolved

ganmacs added 3 commits December 19, 2019 16:42

Mock correct method

6e3b5ab

because replacing Supervisor#read_config with Fluent::Config.build Signed-off-by: Yuta Iwama <[email protected]>

Add plugin name and plugin id to identify a plugin name

a78f7de

Signed-off-by: Yuta Iwama <[email protected]>

Extract as a method

37f64c6

Signed-off-by: Yuta Iwama <[email protected]>

cosmo0920 approved these changes Dec 20, 2019

View reviewed changes

repeatedly reviewed Dec 26, 2019

View reviewed changes

lib/fluent/plugin/buf_file.rb Outdated Show resolved Hide resolved

lib/fluent/static_config_analysis.rb Outdated Show resolved Hide resolved

test/test_config.rb Outdated Show resolved Hide resolved

ganmacs added 2 commits December 27, 2019 11:28

Fix name name and adding space

500b025

Signed-off-by: Yuta Iwama <[email protected]>

Filter is correct...

60440d5

Signed-off-by: Yuta Iwama <[email protected]>

ganmacs self-assigned this Jan 6, 2020

Add rpc endpoint to do light reload

d8b8bc6

Signed-off-by: Yuta Iwama <[email protected]>

repeatedly merged commit 2800465 into fluent:master Jan 7, 2020

ganmacs deleted the light-reload branch April 2, 2020 05:25

daipom mentioned this pull request Mar 8, 2023

Refactor logger initialization #4065

Merged

daipom mentioned this pull request Nov 25, 2024

SIGUSR2: zero downtime restart #4624

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload new config without restarting process #2716

Reload new config without restarting process #2716

ganmacs commented Dec 9, 2019 •

edited

Loading

cosmo0920 left a comment •

edited

Loading

ganmacs commented Dec 23, 2019

repeatedly commented Jan 6, 2020 •

edited

Loading

cosmo0920 commented Jan 6, 2020

ganmacs commented Jan 6, 2020

repeatedly commented Jan 7, 2020 •

edited

Loading

Reload new config without restarting process #2716

Reload new config without restarting process #2716

Conversation

ganmacs commented Dec 9, 2019 • edited Loading

cosmo0920 left a comment • edited Loading

Choose a reason for hiding this comment

ganmacs commented Dec 23, 2019

repeatedly commented Jan 6, 2020 • edited Loading

cosmo0920 commented Jan 6, 2020

ganmacs commented Jan 6, 2020

repeatedly commented Jan 7, 2020 • edited Loading

ganmacs commented Dec 9, 2019 •

edited

Loading

cosmo0920 left a comment •

edited

Loading

repeatedly commented Jan 6, 2020 •

edited

Loading

repeatedly commented Jan 7, 2020 •

edited

Loading